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Abstract 

In this paper we investigate the problem of partitioning an input 
string T in such a way that compressing individually its parts via a base- 
compressor C gets a compressed output that is shorter than applying C 
over the entire T at once. This problem was introduced in [5J |^ in the 
context of table compression, and then further elaborated and extended 
to strings and trees by [101 1111 [2T| . Unfortunately, the literature offers 
poor solutions: namely, we know either a cubic-time algorithm for com- 
puting the optimal partition based on dynamic programming [31 [T3], or 
few heuristics that do not guarantee any bounds on the efficacy of their 
computed partition [51 [S], or algorithms that are efficient but work in 
some specific scenarios (such as the Burrows- Wheeler Transform, see e.g. 
[101 121] ) and achieve compression performance that might be worse than 
the optimal-partitioning by a Q,{^/\ogn) factor. Therefore, computing ef- 
ficiently the optimal solution is still open In this paper we provide 
the first algorithm which is guaranteed to compute in 0(71 log j^,^^ n) time 
a partition of T whose compressed output is guaranteed to be no more 
than (1 -|- e)-worse the optimal one, where e may be any positive constant. 

1 Introduction 

Reorganizing data in order to improve the performance of a given compressor 
C is a recent and important paradigm in data compression (see e.g. [3 flU]). 
The basic idea consist of permuting the input data T to form a new string 
T' which is then partitioned into substrings T' = T[T2 ■ ■ - Tj. that are finally 
compressed individually by the base compressor C. The goal is to find the best 
instantiation of the two steps Permuting+Partitioning so that the compression 
of the individual substrings T/ minimizes the total length of the compressed 
output. This approach (hereafter abbreviated as PPC) is clearly at least as 
powerful as the classic data compression approach that applies C to the entire 
T: just take the identity permutation and set k = 1. The question is whether 
it can be more powerful than that! 

Intuition leads to think favorably about it: by grouping together objects 
that are "related" , one can hope to obtain better compression even using a very 

*It has been partially supported by Yahoo! Research, Italian MIUR Italy-Israel FIRB 
Project. The authors' address is Dipartimento di Informatica, L.go B. Pontocorvo 3, 56127 
Pisa, Italy. 

^Department of Computer Science, University of Pisa, emails: {ferragina, nitto, 
rventur ini }@di . unipi . it 



weak compressor C. Surprisingly enough, this intuition has been sustained by 
convincing theoretical and experimental results only recently. These results have 
investigated the PPC-paradigm under various angles by considering: different 
data formats (strings [TU], trees [TT], tables [S], etc.), different granularities for 
the items of T to be permuted (chars, node labels, columns, blocks files 
[i nulls], etc.), different permutations (see e.g. [H [HI [5] ) , different base 
compressors to be boosted (0-th order compressors, gzip, bzip2, etc.). Among 
these plethora of proposals, we survey below the most notable examples which 
are useful to introduce the problem we attack in this paper, and refer the reader 
to the cited bibliography for other interesting results. 

The PPC-paradigm was introduced in [2], and further elaborated upon in 
[3]. In these papers T is a table formed by fixed size columns, and the goal is 
to permute the columns in such a way that individually compressing contiguous 
groups of them gives the shortest compressed output. The authors of [3] showed 
that the PPC-problem in its full generality is MAX-SNP hard, devised a link 
between PPC and the classical asymmetric TSP problem, and then resorted 
known heuristics to find approximate solutions based on several measures of 
correlations between the table's columns. For the grouping they proposed ei- 
ther an optimal but very slow approach, based on Dynamic Programming (see 
below) , or some very simple and fast algorithms which however did not have any 
guaranteed bounds in terms of efficacy of their grouping process. Experiments 
showed that these heuristics achieve significant improvements over the classic 
gzip, when it is applied on the serialized original T (row- or column- wise) . 
Furthermore, they showed that the combination of the TSP-heuristic with the 
DP-optimal partitioning is even better, but it is too slow to be used in practice 
even on small file sizes because of the DP-cubic time complexity0 

When r is a text string, the most famous instantiation of the PPC-paradigm 
has been obtained by combining the Burrows and Wheeler Transform [S] (shortly 
BWT) with a context-based grouping of the input characters, which are finally 
compressed via proper 0-th order-entropy compressors (like MTF, RLE, Huffman, 
Arithmetic, or their combinations, see e.g. [55]). Here the PPC-paradigm takes 
the name of compression booster jlOj because the net result it produces is to 
boost the performance of the base compressor C from O-th order-entropy bounds 
to fc-th order entropy bounds, simultaneously over all fc > 0. In this scenario the 
permutation acts on single characters, and the partitioning/permuting steps de- 
ploy the context (substring) following each symbol in the original string in order 
to identify "related" characters which must be therefore compressed together. 
Recently [T4| investigated whether do exist other permutations of the charac- 
ters of T which admit effective compression and can be computed/inverted fast. 
Unfortunately they found a connection between table compression and the BWT, 
so that many natural similarity-functions between contexts turned out to in- 
duce MAX-SNP hard permuting problems! Interesting enough, the BWT seems 
to be the unique highly compressible permutation which is fast to be computed 
and achieves effective compression bounds. Several other papers have given 
an analytic account of this phenomenon (221 [HI [13 [H] and have shown, also 
experimentally [8J , that the partitioning of the BW-transformcd data is a key 
step for achieving effective compression ratios. Optimal partitioning is actually 

^Page 836 of [3] says: "computing a good approximation to the TSP reordering before 
partitioning contributes significant compression improvement at minimal time cost. [...] This 
time is negligible compared to the time to compute the optimal, contiguous partition via DP." 
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even more mandatory in the context of labeled-tree compression where a BWT- 
inspired transform, called XBW-transform in [HI [12], allows to produce per- 
muted strings with a strong clustering effect. Starting from these premises |15j 
attacked the computation of the optimal partitioning of T via a DP-approach, 
which turned to be very costly; then |10j (and subsequently many other authors, 
see e.g. [nj[lT|[TT]) proposed solutions which are not optimal but, nonetheless, 
achieve interesting fc-th order-entropy bounds. This is indeed a subtle point 
which is frequently neglected when dealing with compression boosters, espe- 
cially in practice, and for this reason we detail it more clearly in Appendix A 
in which we show an infinite class of strings for which the compression achieved 
by the booster is far from the optimal-partitioning by a multiplicative factor 
ri(Vlog n). 

Finally, there is another scenario in which the computation of the optimal 
partition of an input string for compression boosting can be successful and oc- 
curs when T is a single (possibly long) file on which we wish to apply classic 
data compressors, such as gzip, bzip2, ppm, etc. [25 . Note that how much 
redundancy can be detected and exploited by these compressors depends on 
their ability to "look back" at the previously seen data. However, such ability 
has a cost in terms of memory usage and running time, and thus most com- 
pression systems provide a facility that controls the amount of data that may 
be processed at once — usually called the block size. For example the classic 
tools gzip and bzip2 have been designed to have a small memory footprint, up 
to few hundreds KBs. More recent and sophisticated compressors, like ppm [21] 
and the family of BWT-based compressors [8 , have been designed to use block 
sizes of up to a few hundreds MBs. But using larger blocks to be compressed at 
once does not necessarily induce a better compression ratio! As an example, let 
us take C as the simple Huffman or Arithmetic coders and use them to compress 
the text T = 0"/2i«/2. There is a clear difference whether we compress individ- 
ually the two halves of T (achieving an output size of about 0(log n) bits) or we 
compress T as a whole (achieving n + O(logn) bits). The impact of the block 
size is even more significant as we use more powerful compressors, such as the 
fc-th order entropy encoder ppm which compresses each symbol according to its 
preceding fc-long context. In this case take T = (2'=0)"/(2('=+i)) (2'=l)"/(2('=+i)) 
and observe that if we divide T in two halves and compress them individually, 
the output size is about 0(log7i) bits, but if we compress the entire T at once 
then the output size turns to be much longer, i.e. -I- O(logn) bits. There- 
fore the choice of the block size cannot be underestimated and, additionally, it 
is made even more problematic by the fact that it is not necessarily the same 
along the whole file we are compressing because it depends on the distribution 
of the repetitions within it. This problem is even more challenging when T is 
obtained by concatenating a collection of files via any permutation of them: 
think to the serialization induced by the Unix tar command, or other more 
sophisticated heuristics like the ones discussed in [221 [6l [211 [26] . In these cases, 
the partitioning step looks for homogeneous groups of contiguous files which can 
be effectively compressed together by the base-compressor C. More than before, 
taking the largest memory-footprint offered by C to group the files and compress 
them at once is not necessarily the best choice because real collections are typ- 
ically formed by homogeneous groups of dramatically different sizes (e.g. think 
to a Web collection and its different kinds of pages). Again, in all those cases 
we could apply the optimal DP-based partitioning approach of [151 |3j , but this 



3 



would take more than cubic time (in the overall input size |T|) thus resulting 
unusable even on small input data of few MBs! 

In summary the efficient computation of an optimal partitioning of the input 
text for compression boosting is an important and still open problem of data 
compression (see [1]). The goal of this paper is to make a step forward by 
providing the first efficient approximation algorithm for this problem, formally 
stated as follows. 

Let C be the base compressor we wish to boost, and let r[l, n] be the input 
string we wish to partition and then compress by C. So, we are assuming that T 
has been (possibly) permuted in advance, and we are concentrating on the last 
two steps of the P PC-paradigm. Now, given a partition V of the input string 
into contiguous substrings, say T = T1T2 ■ ■ ■ T^, we denote by Cost(P) the cost 
of this partition and measure it as X]'=i 1^(2^4)1, where |C(a)| is the length in 
bit of the string a compressed by C. The problem of optimally partitioning T 
according to the base-compressor C consists then of computing the partition 
Vopt achieving the minimum cost, namely Vopt — min-p Cost(7'), and thus the 
shortest compressed output H 

As we mentioned above Vopt might be computed via a Dynamic-Programming 
approach [31|TS]. Define E[i] as the cost of the optimum partitioning of T[l,i], 
and set E[0] = 0. Then, for each i > 1, we can compute E[i] as the mmo<j<i-i-E[j'] + 
|C(T[j + At the end E[n] gives the cost of Vopt, which can be explicitly 

determined by standard back-tracking over the DP-array. Unfortunately, this 
solution requires to run C over <d{n^) substrings of average length Q{n), for an 
overall G(n^) time cost in the worst case which is clearly unfeasible even on 
small input sizes n. 

In order to overcome this computational bottleneck we make two crucial 
observations: (1) instead of applying C over each substring of T, we use an 
entropy-based estimation of C's compressed output that can be computed ef- 
ficiently and incrementally by suitable dynamic data structures; (2) we relax 
the requirement for an exact solution to the optimal partitioning problem, and 
aim at finding a partition whose cost is no more than (1 -I- e) worse than Vopt, 
where e may be any positive constant. Item (1) takes inspiration from the 
heuristics proposed in f2| ^Sj , but it is executed in a more principled way because 
our entropy-based cost functions reflect the real behavior of modern compres- 
sors, and our dynamic data structures allow the efficient estimation of those 
costs without their re-computation from scratch at each substring (as instead 
occurred in OE]). Item (2) boils down to show that the optimal partitioning 
problem can be rephrased as a Single Source Shortest path computation over 
a weighted DAG consisting of n nodes and 0{n^) edges whose costs are derived 
from item (1). We prove some interesting structural properties of this graph 
that allow us to restrict the computation of that SSSP to a subgraph consist- 
ing of 0{nlogij^^n) edges only. The technical part of this paper (see Section 
131) will show that we can build this graph on-the-fly as the SSSP-computation 
proceeds over the DAG via the proper use of time-space efficient dynamic data 
structures. The final result will be to show that we can (1 -I- e)-approximate 
Vopt in 0{n\ogi_f_^n) time and 0{n) space, for both 0-th order compressors 
(like Huffman and Arithmetic [5S]) and fc-th order compressors (like ppm 28J). 

^We are assuming that C{a) is a prefix-free encoding of a, so that we can concatenate the 
compressed output of many substrings and still be able to recover them via a sequential scan. 
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We will also extend these results to the class of BWT-based compressors, when T 
is a collection of texts. 

We point out that the result on 0-th order compressors is interesting in its 
own from both the experimental side, since Huffword compressor is the stan- 
dard choice for the storage of Web pages [28], and from the theoretical side 
since it can be applied to the compression booster of fW\ to fast obtain an 
approximation of the optimal partition of BWT(T) in 0{n\ogi^^n) time. This 
may be better than the algorithm of |10] both in time complexity, since that 
takes 0(n|I]|) time where S is the alphabet of T, and in compression ratio (as 
we have shown above, see Appendix A). The case of a large alphabet (namely, 

— r2(polylog(n))) is particularly interesting whenever we consider either 
a word-based BWT j23| or the XBW-transform over labeled trees [TU]. Finally, 
we mention that our results apply also to the practical case in which the base 
compressor C has a maximum (block) size B of data it can process at once (see 
above the case of gzip, bzip2, etc.). In this situation the time performance of 
our solution reduces to 0{n\ogi^^{B \oga)) . 

The map of the paper is as follows. Section [5] introduces some basic notation 
and terminology. Section[3]describes our reduction from the optimal partitioning 
problem of T to a SSSP problem over a weighted DAG in which edges represent 
substrings of T and edge costs are entropy-based estimations of the compression 
of these substrings via C. The subsequent Sections will address the problem of 
incrementally and efficiently computing those edge costs as they are needed 
by the SSSP-computation, distinguishing the two cases of 0-th order estimators 
(Section [U and fc-th order estimators (Section [S]), and the situation in which C 
is a BWT-based compressor and T is a collection of files (Section [1]). 

2 Notation 

In this paper we will use entropy-based upper bounds for the estimation of 
|C(T[z,j])|, so we need to recall some basic notation and terminology about 
entropies. Let T[l, n] be a string drawn from the alphabet E of size a. For each 
c € S, we let Uc be the number of occurrences of c in T. The zero-th order 
empirical entropy of T is defined as Hq{T) = ^ X^ces '^^S 

Recall that \T\Hq{T) provides an information-theoretic lower bound to the 
output size of any compressor that encodes each symbol of T with a fixed 
code [21]. The so-called zero-th order statistical compressors (such as Huffman 
or Arithmetic [28]) achieve an output size which is very close to this bound. 
However, they require to know information about frequencies of input symbols 
(called the model of the source) . Those frequencies can be either known in ad- 
vance {static model) or computed by scanning the input text {semistatic model). 
In both cases the model must be stored in the compressed file to be used by the 
decompressor. 

In the following we will bound the compressed size achieved by zero-th order 
compressors over T by |Co(T)| < XnHo{T) + fo{n, a) bits, where A is a positive 
constant and fo{n,a) is a function including the extra costs of encoding the 
source model and/or other inefficiencies of C. In the following we will assume 
that the function /o(n, cr) can be computed in constant time given n and a. 
As an example, for Huffman fo^n^a) = crlogcr + n bits and A = 1, and for 
Arithmetic fo{n, a) = crlogn -f- log n/n bits and A = 1. 
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In order to evict the cost of the model, we can resort to zero-th order adap- 
tive compressors that do not require to know the symbols' frequencies in ad- 
vance, since they are computed incrementally during the compression. The 
zero-th order adaptive empirical entropy of T |16j is then defined as Hq{T) = 
]¥] Sees ■'^^S ^ ^i^l bound the compress size achieved by zero-th order 
adaptive compressors over T by |Cq(T)| = nHQ(T) bits. 

Let us now come to more powerful compressors. For any string u of length 
fc, we denote by ut the string of single symbols following the occurrences of u 
in T, taken from left to right. For example, if T = mississippi and u ~ si, we 
have Ut — sp since the two occurrences of si in T are followed by the symbols 
s and p, respectively. The fc-th order empirical entropy of T is defined as 
Hk{T) = X^mge'' \ut\Ho{ut)- Analogously, the fc-th order adaptive empirical 
entropy of T is defined as H^{T) = ^ J^ueT,'' H§{ut) 

We have Hk{T) > Hk+i{T) for any fc > 0. As usual in data compression 
[22], the value nHk(T) is an information-theoretic lower bound to the output 
size of any compressor that encodes each symbol of T with a fixed code that 
depends on the symbol itself and on the k immediately preceding symbols. 
Recently (see e.g. [HI [551 dHl [211 [E] and refs therein) authors have provided 
upper bounds in terms of Hk{\T\) for sophisticated data-compression algorithms, 
such as gzip [18], bzip2 [22l [TOl [E] , and ppm. These bounds have the form 
|C(r)| < A|T| HkiT) + fk{\T\,a), where A is a positive constant and /fc(|T|, ct) is 
a function including the extra-cost of encoding the source model and/or other 
inefficiencies of C. The smaller are A and /fc(), the better is the compressor C. 
As an example, the bound of the compressor in [3T] has A = 1 and /(|r|, cr) = 
0(f7'=+Mog |T| + |r|logcrloglog|r|/log|r|). similar bounds that involve the 
adaptive fc-th order entropy are known [22l [TOl [9] for many compressors. In 
these cases the bound takes the form |C;j(T)| = X\T\Hl{T) + gk{o') bits, where 
the value of gk depends only on the alphabet size a. 

In our paper we will use these entropy-based bounds for the estimation of 
|C(T[i,j])|, but of course this will not be enough to achieve a fast DP-based 
algorithm for our optimal-partitioning problem. We cannot re-compute from 
scratch those estimates for every substring T[i,j] of T, being them 8(ri^) in 
number. So we will show some structural properties of our problem (Section 
[3]) and introduce few novel technicalities (Sections [IH!]) that will allow us to 
compute Hk{T[i,j]) only on a reduced subset of T's substrings, having size 
0(nlog]^^g n), by taking 0(polylog(n)) time per substring and 0{n) space 
overall. 

3 The problem and our solution 

The optimal partitioning problem, stated in Section [l] can be reduced to a 
single source shortest path computation (SSSP) over a directed acyclic graph 
G{T) defined as follows. The graph G{T) has a vertex Vi for each text position 
i of T, plus an additional vertex Vn+i marking the end of the text, and an 
edge connecting vertex Vi to vertex Vj for any pair of indices i and j such that 
i < j. Each edge {vi,Vj) has associated the cost c{vi,Vj) — \C{T[i,j — 1])| that 
corresponds to the size in bits of the substring T[i,j — 1] compressed by C. We 
remark the following crucial, but easy to prove, property of the cost function 
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defined on g{T): 



Fact 1 For any vertex Vi, it is < c{vi, w^+i) < c(vi, ^^+2) < • ■ • < c{vi,Vn+i) 

There is a one-to-one correspondence between paths from vi to Vn+i in 
G{T) and partitions of T: every edge {vi,Vj) in the path identifies a contiguous 
substring T[i, j — 1] of the corresponding partition. Therefore the cost of a path 
is equal to the (conipression-)cost of the corresponding partition. Thus, we can 
find the optimal partition of T by computing the shortest path in G (T) from vi 
to Vn+i- Unfortunately this simple approach has two main drawbacks: 

1. the number of edges in g{T) is 8(n^), thus making the SSSP computation 
inefficient (i.e. r2(n^) time) if executed directly over G{T); 

2. the computation of the each edge cost might take 8(n) time over most 
T's substrings, if C is run on each of them from scratch. 

In the following sections we will successfully address both these two draw- 
backs. First, we sensibly reduce the number of edges in the graph g{T) to be 
examined during the SSSP computation and show that we can obtain a (1 -I- e) 
approximation using only 0{nlogi_^_^ n) edges, where e > is a user-defined pa- 
rameter (Section 13. ip . Second, we show some sufficient properties that C needs 
to satisfy in order to compute efficiently every edge's cost. These properties 
hold for some well-known compressors — e.g. 0-order compressors, PPM-like and 
bzip-like compressors — and for them we show how to compute each edge cost 
in constant or poly logarithmic time (Sections HHSl)- 

3.1 A pruning strategy 

The aim of this section is to design a pruning strategy that produces a subgraph 
(T) of the original DAG G (T) in which the shortest path distance between its 
leftmost and rightmost nodes, vi and increases by no more than a factor 

(1 + e). We define Ge{T) to contain all edges {vi, vj) of G{T), recall i < j, such 
that at least one of the following two conditions holds: 

1. there exists a positive integer k such that c{vi, Vj) < (1 + e)'^' < c{vi, Vj+i)', 

2. j 

In other words, by fact[Tl we are keeping for each integer k the edge of G{T) 
that approximates at the best the value (1 -I- e)'' from below. Given this, we 
will call e-maximal the edges of Ge{T). Clearly, each vertex of Ge{T) has at 
most logj^^^n = 0(i logn) outgoing edges, which are e-maximal by definition. 
Therefore the total size of Ge{T) is at most 0{j logn). Hereafter, we will denote 
with dci—,—) the shortest path distance between any two nodes in a graph G. 

The following lemma states a basic property of shortest path distances over 
our special DAG G{T): 

Lemma 1 For any triple of indices l<i<j<q<n+l we have: 

1. dg(T){V],Vq) < dg(T){Vt,Vq) 

2. dg(T){Vi,Vj) < dg(^T){Vt,Vg) 
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Proof: We prove just 1, since 2 is symmetric. It suffices by induction to prove 
the case j = i + Let {vi,wi){wi,W2)---{wh^i, Wh), with Wh = fg, be a shortest 
path in Q{T) from Vi to Vq. By fact[Tl c(uj, wi) < c{vi,wi) since i < j- Therefore 
the cost of the path {vj,wi){wi,W2)---{wh-i,'Wh) is at most dg(T){vi,Vq), which 
proves the claim. ■ 

The correctness of our pruning strategy refies on the foUowing theorem: 

Theorem 1 For any text T, the shortest path in G<l{T) from vi to u„+i has a 
total cost of at most (1 + e) 

Proof: We prove a stronger assertion: d^^cr) (w^, ti„+i) < (1 + e) rfe(T)(wi, Wn+i) 
for any index 1 < i < n + 1. This is clearly true for i — n + 1, because 
in that case the distance is 0. Now let us inductively consider the shortest 
path TT in Q{T) from Vi to Vn+i and let {vk,vt^){vti,vt^) . . . {vt^Vn+i) be its 
edges. By the definition of e-maximal edge, it is possible to find an e-maximal 
edge {vk,Vr) with ti < r, such that c{vk,Vr) < (1 + e) c{vk,vti). By Lemma 

[H dg^T){Vr,Vn+l) < dg(T){vt^ ,Vn+i) . By induction, dg^{T)[Vr,Vn+l) < (1 + 

e) c'e(T) ('^r, Vn+i). Combining this with the triangle inequality we get the thesis. 



3.2 Space and time efficient algorithms for generating QeiT) 

Theorem [1] ensures that, in order to compute a (1 + e) approximation of the 
optimal partition of T, it suffices to compute the SSSP in ^e(T) from vi to 
Vn+i- This can be easily computed in 0{\Qe{T)\) — 0{n\og^n) time since 
Ge{T) is a DAG [7], by making a single pass over its vertices and relaxing all 
edges going out from the current one. 

However, generating Ge{T) in efficient time is a non-trivial task for three 
main reasons. First, the original graph G{T) contains ri(n^) edges, so that 
we cannot check each of them to determine whether it is e-maximal or not, 
because this would take fl{n'^) time. Second, we cannot compute the cost of an 
edge {vi,Vj) by executing C{T[i,j — 1]) from scratch, since this would require 
time linear in the substring length, and thus fl(n^) time over all T's substrings. 
Third, we cannot materialize Gc{T) (e.g. its adjacency lists) because it consists 
of Q{n polylog(n)) edges, and thus its space occupancy would be super-linear 
in the input size. 

The rest of this section is devoted to design an algorithm which overcomes the 
three limitations above. The specialty of our algorithm consists of materializing 
Ge{T) on-the-fly, as its vertices are examined during the SSSP-computation, by 
spending only polylogarithmic time per edge. The actual time complexity per 
edge will depend on the entropy-based cost function we will use to estimate 
\C{T[i,j — 1])| (see Section[2]) and on the dynamic data structure we will deploy 
to compute that estimation efficiently. 

The key tool we use to make a fast estimation of the edge costs is a dynamic 
data structure built over the input text T and requiring 0(|r|) space. We state 
the main properties of this data structure in an abstract form, in order to design 
a general framework for solving our problem; in the next sections we will then 
provide implementations of this data structure and thus obtain real time/space 
bounds for our problem. So, let us assume to have a dynamic data structure 
that maintains a set of sliding windows over T denoted by wi, u'2, . . . , wiogj^^ n- 
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The sliding windows are substrings of T which start at the same text position 
I but have different lengths: namely, Wi = T[l,ri\ and ri < r2 < . . . < nog^^^n- 
The data structure must support the following three operations: 

1. Remove() moves the starting position / of all windows one position to the 
right (i.e. I + 1); 

2. Append(z«i) moves the ending position of the window Wi one position to 
the right (i.e. + 1); 

3. Size(u'i) computes and returns the value |C(T[Z, ri])|. 

This data structure is enough to generate e-maximal edges via a single pass 
over T, using 0(|T|) space. More precisely, let vi be the vertex of C/(r) currently 
examined by our SSSP computation, and thus I is the current position reached 
by our scan of T. We maintain the following invariant: the sliding windows 
correspond to all e-maximal edges going out from vi, that is, the edge {vi, ^'i+r-t) 
is the e-maximal edge satisfying c(wi , vi+rt ) < (1+e)* < c{vl,Vl_^_(^r^_^_l'^). Initially 
all indices are set to 0. To maintain the invariant, when the text scan advances 
to the next position I + I, we call operation Remove() once to increment index I 
and, for each t ~ 1, . . . , logi_^_^{n) , we call operation Append(zi;t) until we find the 
largest rt such that Size(ti;f) — c(w;, wi+^J < (l-|-e)*. The key issue here is that 
Append and Size are paired so that our data structure should take advantage 
of the rightward sliding of rt for computing c{vi , vi+n ) efficiently. Just one 
character is entering wt to its right, so we need to deploy this fact for making 
the computation of Size(wt) fast (given its previous value). Here comes into 
play the second contribution of our paper that consists of adopting the entropy- 
bounded estimates for the compressibility of a string, mentioned in Section [21 
to estimate indeed the edge costs Size(?i;t) = |C(wt)|. This idea is crucial 
because we will be able to show that these functions do satisfy some structural 
properties that admit a fast incremental computation, as the one required by 
Append + Size. These issues will be discussed in the following sections, here we 
just state that, overall, the SSSP computation over Ge{T) takes 0{n) calls to 
operation Remove, and 0{n\ogi_^_^n) calls to operations Append and Size. 

Theorem 2 // we have a dynamic data structure occupying 0{n) space and 
supporting operation Remove in time L{n), and operations Append and Size in 
time R{n), then we can compute the shortest path in Qe{T) from vi to 
taking 0{n L{n) + (nlogj^^^ n) R{n)) time and 0{n) space. 

4 On zero-th order compressors 

In this section we explain how to implement the data structure above whenever 
C is a 0-th order compressor, and thus Hq is used to provide a bound to the 
compression cost of tJ(T)'s edges (see Section [2|). The key point is actually to 
show how to efficiently compute Size(wj) as the sum of |r[^, r^] |i7o(r[^, r^]) — 
ScGS log((ri — I + l)/nc) (see its definition in Section [2]) plus fo{ri — I + 
1, IS-p^/ I), where is the number of occurrences of symbol c in T[Z,rj] and 
|E2^[; ,,;]! denotes the number of different symbols in T[l,r^i\. 

The first solution we are going to present is very simple and uses 0{a) 
space per window. The idea is the following: for each window wi we keep in 
memory an array of counters Ai[c\ indexed by symbol c in E. At any step of 
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our algorithm, the counter Ai[c] stores the number of occurrences of symbol c 
in T[l,ri\. For any window Wi, we also use a variable Ei that stores the value 
of X^ces log Ai[c]. It is easy to notice that: 

\T[l,n]\ HoiT[l,n]) = {n-l + i)\og{n-l + l)^E,. (1) 

Therefore, if we know the value of Ei, we can answer to a query S±ze(wi) 
in constant time. So, we are left with showing how to implement efhciently the 
two operations that modify / or any rs value and, thus, modify appropriately 
the E^s value. This can be done as follows: 

1. Remove (): For each window Wi, we subtract from the appropriate counter 
and from variable Ei the contribution of the symbol T[l] which has been 
evicted from the window. That is, we decrease A,;[T[/]] by one, and up- 
date Ei by subtracting + 1) log(Ai[T[/]] + 1) and then summing 
A,[T[l]] log A,[r[/]]. Finally we set I = I + 1. 

2. Append(w,;): We add to the appropriate counter and variable Ei the con- 
tribution of the symbol T[ri + 1] which has been appended to window Wi. 
That is, we increase Ai[T[r-|- 1]] by one, then we update Ei by subtracting 
iA[T[r, + log(A[r[r, + and summing A[T[n + 1]] log A[r[r, + 
1]]. Finally we set — ri + 1. 

In this way, operation Remove requires constant time per window, hence 
0(log2^f n) time overall. Append(wi) takes constant time. The space required 
by the counters Ai is 0{a logj^^^ n) words. Unfortunately, the space complexity 
of this solution can be too much when it is used as the basic-block for computing 
the fc-th order entropy of T (see Section [2|) as we will do in Section [5l In fact, we 
would achieve min(cr*'"'"^ logi+e '^j ^ logi+e space, which may be superlinear in 
n depending on a and k. 

The rest of this section is therefore devoted to provide an implementation 
of our dynamic data structure that takes the same query time above for these 
three operations, but within 0{n) space, which is independent of a and k. The 
new solution still uses E's value but the counters Ai are computed on-the-fly 
by exploiting the fact that all windows share the same value of /. We keep an 
array B indexed by symbols whose entry B[c] stores the number of occurrences 
of c in T[l,l]. We can keep these counters updated after a Remove by simply 
decreasing _B[T[Z]] by one. We also maintain an array R with an entry for each 
text position. The entry R[j] stores the number of occurrences of symbol T[j] 
in T[l,j]. The number of elements in both B and R is no more than n, hence 
they take 0{n) space. 

These two arrays are enough to correctly update the value Ei after Append(u'i), 
which is in turn enough to estimate Hq (see Eqn[T|). In fact, we can compute 
the value yli[r[ri -I- 1]] by computing R[ri -I- 1] — B[T[ri + 1]] which correctly 
reports the number of occurrences of T[ri + 1] in T[l . . . r.; + 1]. Once we have 
the value of Ai[T[ri 4-1]], we can update Ei as explained in the above item 2. 

We are left with showing how to support Remove() whose computation re- 
quires to evaluate the value of Ai[r[Z]] for each window Wi. Each of these values 
can be computed as R[t] — B[T[l]] where t is the last occurrence of symbol T[l] in 
T[l, ri]. The problem here is given by the fact that we do not know the position 
t. We solve this issue by resorting to a doubly linked list Lc for each symbol c. 
The list Lc links together the last occurrences of c in all those windows, ordered 
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by increasing position. Notice that a position j may be the last occurrence of 
symbol T[j] for different (but consecutive) windows. In this case we force that 
position to occur in Lrpy] just once. These lists are sufficient to compute values 
v4i[r[^]] for all the windows together. In fact, since any position in Lth] is the 
last occurrence of at least one sliding window, each of them can be used to com- 
pute ^i[T[Z]] for the appropriate indices i. Once we have all values 
can update all iJ^'s as explained in the above item 1. Since list Lrp[i] contains no 
more than logj^^^ n elements, all Es can be updated in 0(\ogi_^_^ n) time. Notice 
that the number of elements in all the lists L is bounded by the text length. 
Thus, they are stored using 0{n) space. 

It remains to explain how to keep lists L correctly updated. Notice that 
only one list may change after a Remove() or an Append(wi). In the former case 
we have possibly to remove position I from list This operation is simple 

because, if that position is in the list, then T[l] is the last occurrence of that 
symbol in wi (recall that all the windows start at position /, and are kept ordered 
by increasing ending position) and, thus, it must be the head of Lt[i]- The case 
of Append(it;i) is more involved. Since the ending position of Wi is moved to 
the right, position + 1 becomes the last occurrence of symbol T[ri + 1] in Wi. 
Recall that Append(it;i) inserts symbol T[ri + 1] in Wi. Thus, it must be inserted 
in Lxin+i] in its correct (sorted) position, if it is not present yet. Obviously, we 
can do that in ©(log^^^ n) time by scanning the whole list. This is too much, 
so we show how to spend only constant time. Let p the rightmost occurrence 
of the symbol T[ri + 1] in T[0,r,]ll If p < I, then ri + 1 must be inserted in 
the front of Lj^[ri+i] E^^id we have done. In fact, p < I implies that there is no 
occurrence of T[ri + 1] in T[l, ri] and, thus, no position can precede + 1 in 
Lx[ri+i]- Otherwise (i.e. p > I), we have that p is in LTin+i], because it is 
the last occurrence of symbol T[ri + 1] for some window Wj with j < i. We 
observe that if Wj = Wi, then p must be replaced by + 1 which is now the 
last occurrence of T[ri + 1] in Wi] otherwise + 1 must be inserted after p in 
Lx[ri+i] because p is still the last occurrence of this symbol in the window Wj. 
We can decide which one is the correct case by comparing p and ri_i (i.e., the 
ending position of the preceding window ■u;,.._^). In any case, the list is kept 
updated in constant time. 

The following Lemma derives by the discussion above: 

Lemma 2 Let T[l,n\ be a text drawn from an alphabet of size a = poly(n). 
If we estimate Size() via 0-th order entropy (as detailed in Section\^, then 
we can design a dynamic data structure that takes 0{n) space and supports 
the operations Remove in R{n) = 0(\ogi^^ n) time, and Append and Size in 
L{n) — 0(1) time. 

In order to evict the cost of the model from the compressed output (see 
Section [5]), authors typically resort to zero-th order adaptive compressors which 
do not store the symbols' frequencies, since they are computed incrementally 
during the compression |16j . A similar approach can be used in this case to 
achieve the same time and space bounds of Lemma [5] Here, we require that 
Size(w,) = \C^{T[l,r,])\ = \T[l,ri\\H^{T[l,ri]). RecaU that with these type of 
compressors the model must not be stored. We use the same tools above but we 

^Notice that we can precompute and store the last occurrence of symbol T[j + 1] in T[l, j] 
for all js in linear time and space. 
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change the values stored in variables Ei and the way in which they are updated 
after a Remove or an Append. 

Observe that in this case we have that 

\CSiT[l,n])\ = \T[l,n]\HSiT[l,n])^logiin-l + iy.)-J2^og{n,l) 

where ric is the number of occurrences of symbol c in T[l, 7\]. Therefore, if the 
variable Si stores the value ^^^j^ log(^i[c]!), then we have that \T[l,ri]\H§{T[l,ri]) = 

logiin - 1 + - eM 

After the two operations, we change i?'s value in the following way: 

1. Remove(): For any window Wi we update Ei by subtracting log(Ai[^W])- 
We also increase / by one. 

2. Append(?i;i): We update Ei by summing logA[T[7-i + 1]] and we increase 
ri by one. 

By the discussion above and Theorem [2] we obtain: 

Theorem 3 Given a text T[l, n] drawn from an alphabet of size a — poly(n), 
we can find an (1 + e)-optimal partition ofT with respect to a 0-th order (adap- 
tive) compressor in 0{nlogi^^n) time and 0{n) space, where e is any positive 
constant. 

We point out that these results can be applied to the compression booster 
of [TU] to fast obtain an approximation of the optimal partition of BWT(T). 
This may be better than the algorithm of [TU] both in time complexity, since 
that algorithm took 0{na) time, and in compression ratio by a factor up to 
Q,{\/\og n) (see the discussion in Section[T]). The case of a large alphabet (namely, 
(T — r2(polylog(n))) is particularly interesting whenever we consider either a 
word-based BWT |23j or the XBW-transform over labeled trees [10] . We notice that 
our result is interesting also for the Huffword compressor which is the standard 
choice for the storage of Web pages [H] ; here E consists of the distinct words 
constituting the Web-page collection. 

5 On k-th. order compressors 

In this section we make one step further and consider the more powerful fc-th 
order compressors, for which do exist bounds for estimating the size of their 
compressed output (see Section [2]). Here Size(uii) must compute |C(r[Z,ri])| 
which is estimated by {ri — I + \)Hk{T[l,ri]) -\- fk{ri — / -f- 1, |I]T[;,ri] I), where 
'^T[i,ri] denotes the number of different symbols in T[l, ri].. 

Let us denote with Tq[l,n—q] the text whose i-th symbol T[i] is equal to the 
g-gram T[i, i -\- q — 1]. Actually, we can remap the symbols of Tq to integers in 
[1, n] without modifying its zero-th order entropy. In fact the number of distinct 
g-grams occurring in Tq is less than n, the length of T. Thus T^'s symbols take 
0(log n) bits and Tq can be stored in 0{n) space. This remapping takes linear 
time and space, whenever a is polynomial in n. 

^Notice that the value log((ri — ( -|- 1)!) can be stored in a variable and updated in constant 
time since the size of the value Ti — I -\- I changes just by one after a Remove or an Append. 
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A simple calculation shows that the fc-th order (adaptive) entropy of a string 
(see definition Section [2]) can be expressed as the difference between the zero-th 
order (adaptive) entropy of its k + 1-grams and its fc-grams. This suggests that 
we can use the solution of the previous section in order to compute the zero-th 
order entropy of the appropriate substrings of Tk+i and T^. More precisely, we 
use two instances of the data structure of Theorem [3] (one for T^+i and one 
for Tfe), which are kept synchronized in the sense that, when operations are 
performed on one data structure, then they are also executed on the other. 

Lemma 3 Let T[l,n\ be a text drawn from an alphabet of size a — poly(7i). 
If we estimate Size() via k-th order entropy (as detailed in Section\^, then 
we can design a dynamic data structure that takes 0{n) space and supports 
the operations Remove in R{n) = 0(log]^^j n) time, and Append and Size in 
L{n) = 0(1) time. 

Essentially the same technique is applicable to the case of fc-th order adaptive 
compressor C, in this case we keep up-to-date the 0-th order adaptive entropies 
of the strings Tk+i and (details in [?]). 

Theorem 4 Given a text T[l,n] drawn from an alphabet of size a = poly(rt), 
we can find an (l + e)- optimal partition ofT with respect to a k-th order (adap- 
tive) compressor in ©(nlog^^^ n) time and 0{n) space, where e is any positive 
constant. 

We point out that this result applies also to the practical case in which the 
base compressor C has a maximum (block) size B of data it can process at once 
(this is the typical scenario for gzip, bzip2, etc.). In this situation the time 
performance of our solution reduces to 0{n\og^j^^{B\oga)). 

6 On BWT-based compressors 

As we mentioned in Section[2]we know entropy-bounded estimates for the output 
size of BWT-based compressors. So we could apply Theorem H] to compute the 
optimal partitioning of T for such a type of compressors. Nevertheless, it is also 
known [S| that such compression-estimates are rough in practice because of the 
features of the compressors that are applied to the BWT(r)-string. Typically, 
SWT is encoded via a sequence of simple compressors such as MTF encoding, 
RLE encoding (which is optional), and finally a 0-order encoder like Huffman 
or Arithmetic [28 . For each of these compression steps, a 0-th entropy bound 
is known |22j . but the combination of these bounds may result much far from 
the final compressed size produced by the overall sequence of compressors in 
practice 8 . 

In this section, we propose a solution to the optimal partitioning problem 
for BWT-based compressors that introduces a <d{a\ogn) slowdown in the time 
complexity of Theorem HI but with the advantage of computing the (1 -I- e)- 
optimal solution wrt the real compressed size, thus without any estimation 
by any entropy-cost functions. Since in practice it is cr = polylog(n), this 
slowdown should be negligible. In order to achieve this result, we need to address 
a slightly different (but yet interesting in practice) problem which is defined as 
follows. The input string T has the form 5 [1] #15* [2] #2 • ■ • S[m]^n where each 
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S[i] is a text (called page) drawn from an alphabet E, and ^i, =ff2, • ■ • , are 
special characters greater than any symbol of S. A partition of T must be 
page-aligned, that is it must form groups of contiguous pages S[i]^i . . . S[j]^j, 
denoted also S[i,j]. Our aim is to find a page-aligned partition whose cost (as 
defined in Section[T]) is at most (1 -|- e) the minimum possible cost, for any fixed 
e > 0. We notice that this problem generalizes the table partitioning problem 
[5], since we can assume that S[i] is a column of the table. 

To simplify things we will drop the RLE encoding step of a BWT-based algo- 
rithm, and defer the complete solution to the full version of this paper. We 
start by noticing that a close analog of Theorem [2] holds for this variant of the 
optimal partitioning problem, which implies that a (1 -|- e)-approximation of the 
optimum cost (and the corresponding partition) can be computed using a data 
structure supporting operations Append, Remove, and Size; with the only dif- 
ference that the windows wi, W2, ■ • . , Wm subject to the operations are groups 
of contiguous pages of the form wt = S[l, r^]. 

It goes without saying that there exist data structures designed to dynami- 
cally maintain a dynamic text compressed with a BWT-based compressor under 
insertions and deletions of symbols (see [13| and references therein). But they 
do not fit our context for two reasons: (1) their underlying compressor is signif- 
icantly different from the scheme above; (2) in the worst case, they would spend 
linear space per window yielding a super-linear overall space complexity. 

Instead of keeping a given window w in compressed form, our approach 
will only store the frequency distribution of the integers in the string w' — 
MTF(BWT(u')) since this is enough to compute the compressed output size pro- 
duced by the final step of the BWT-based algorithm, which is usually implemented 
via Huffman or Arithmetic [28]. Indeed, since MTF produces a sequence of in- 
tegers from to cr, we can store their number of occurrences for each window 
Wi into an array F^^ of size a. The update of F^. due to the insertion or the 
removal of a page in Wi incurs two main difhculties: (1) how to update as 
pages are added/removed from the extremes of the window Wi, (2) perform this 
update implicitly over F^^^ because of the space reasons mentioned above. Our 
solution relies on two key facts about BWT and MTF: 

1. Since the pages are separated in T by distinct separators, inserting or re- 
moving one page into a window w does not alter the relative lexicographic 
order of the original sufhxes of w (see |13j). 

2. If a string s' is obtained from string s by inserting or removing a char 
c into an arbitrary position, then MTF(s') differs from MTF(s) in at most 
a symbols. More precisely, if c' is the next occurrence in s of the newly 
inserted (or removed) symbol c, then the MTF has to be updated only in 
the first occurrence of each symbol of S among c and c'. 

Due to space limitations we defer the solution to the Appendix B, and state 
here the result we are able to achieve. 

Theorem 5 Given a sequence of texts of total length n and alphabet size a = 
poly(n), we can compute an (1 + e)- approximate solution to the optimal parti- 
tioning problem for a B\]T-based compressor, in 0{n{\ogi^^ n) a logn) time and 
0{n + alogi^^n) space. 
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7 Conclusion 



In this paper we have investigated the problem of partitioning an input string 
T in such a way that compressing individually its parts via a base-compressor 
C gets a compressed output that is shorter than applying C over the entire T 
at once. We provide the first algorithm which is guaranteed to compute in 
0(n logi_|_j n) time a partition of T whose compressed output is guaranteed to 
be no more than (1 + e)-worse the optimal one, where e may be any positive 
constant. As future directions of research we would like either to investigate the 
design of o(n^) algorithms for computing the exact optimal partition, and/or 
experiment and engineer our solution over large datasets. 
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Appendix A 

An example for the booster 

In this section we prove that there exists an infinite class of strings for which the 
partition selected by booster is far from the optimal one by a factor Q{^logn). 
Consider an alphabet S = {ci, C2, . . . , Co-} and assume that ci < C2 < . ■ . < Ca- 
We divide it into I = a/a groups of a consecutive symbols each, where a > 
will be defined later. Let Si, S2, . . . , denote these sub-alphabets. For each 
Si, we build a De Bruijn sequence Ti in which each pair of symbols of S^ occurs 
exactly once. By construction each sequence Ti has length a^. Then, we define 
T ~ T1T2 ■ ■ - Ti, so that |r| — aa and each symbol of E occurs exactly a times 
in T . Therefore, the first column of BWT matrix is equal to (ci)"(c2)" . . . (co-)". 
We denote with the portion of BWT(r) that has symbol c as prefix in the 
BWT matrix. By construction, if c G S^, we have that any has either one 
occurrence of each symbol of or one occurrence of these symbols of minus 
one plus one occurrence of some symbol of Ei_i (or if i = 1). In both cases, 
each Lc has a symbols, which are all distinct. Notice that by construction, the 
longest common prefix among any two suffixes of T is at most 1. Therefore, 
since the booster can partition only using prefix-close contexts (see [10 ), there 
are just three possible partitions: (1) one substring containing all symbols of L, 
(2) one substring per Lc, or (3) as many substrings as symbols of L. Assuming 
that the cost of each model is at least log a bit^, then the costs of all possible 
booster's partitions are: 

1. Compressing the whole L at once has cost at least aalogcr bits. In fact, 
all the symbols in S have the same frequency in L. 

2. Compressing each string costs at least a log a -I- log a bits, since each 
Lc contains a distinct symbols. Thus, the overall cost for this partition is 
at least aa log a + a log a bits. 

3. Compressing each symbol separately has overall cost at least aaloga bits. 

We consider the alternative partition which is not achievable by the booster 
that subdivides L into aja^ substrings denoted Si, 82-, ■ ■ ■ , S^/a^ of size 
symbols each (recall that |r| = aa). Notice that each Si is drawn from an 
alphabet of size smaller than a^. 

The strings Si are compressed separately. The cost of compressing each 
string Si is 0(q!^ log -I- logo-) = 0(a^ loga -I- loga). Since there are a/a^ 
strings S'^s, the cost of this partition is P = 0{aa\oga + (a jo?) loga). There- 
fore, by setting a — 0{\J\Qga j log log cr), we have that P — 0{a\J\oga) bits. As 
far as the booster is concerned, the best compression is achieved by its second 
partition whose cost is O(crlogCT) bits. Therefore, the latter is ^{\J\og a) times 
larger than our proposed partition. Since a > ^fn, the ratio among the two 
partitions is ^{y/\ogn). 

^Here we assume that it contains at least one symbol. Nevertheless, as we will see, the 
compression gap between booster's partition and the optimal one grows as the cost of the 
model becomes bigger. 
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Appendix B 

Proof of Theorem [5] 

We describe a data structure supporting operations Append(w) and Remove () 
when the base compressor is BWT-based, and the input text T is the concatena- 
tion of a sequence of pages 5(1], >S'[2], . . . , S[m] separated by unique separator 
symbols #i , #2 , ■ • ■ , #m , which are not part of S and are lexicographically larger 
than any symbol in S. We assume that the separator symbols in the BWT(r) are 
ignored by the MTF step, which means that when the MTF encoder finds a sepa- 
rator in BWT(r), this is replaced with the corresponding integer without altering 
the MTF-list. This variant does not introduce any compression penalty (because 
every separator occurs just once) but simplifies the discussion that follows. We 
denote with saT[l,'T-] and isaT[^,n\ respectively the suffix array of T and its 
inverse. Given a range I — [a,b] of positions of T, an occurrence of a symbol 
of BWT(T) is called actwe^a,b] if it corresponds to a symbol in T[a^h\. For any 
range [a, b] C [n] of positions in T, we define RBWT(T[a, h\) as the string obtained 
by concatenating the active [at] symbols of BWT(r) by preserving their relative 
order. In the following, we will not indicate the interval when it will be clear 
from the context. Notice that, due to the presence of separators, RBWT(r[a, 6]) 
coincides with BWT(T[a, b]) when T[a, b] spans a group of contiguous pages (see 
[T5] and references therein). Moreover, MTF(RBWT(T[a, 6])) is the string obtained 
by performing the MTF algorithm on RBWT(T[a, 5]). We will call the symbol 
MTF(RBWT(T[a, 6]))[i] as the MTF-encoding of the symbol RBWT(T[a, 6])[i]. 

For each window w, our solution will not explicitly store neither RBWT(?i') 
or MTF(RBWT(r[a, 5])) since this might require a superlinear amount of space. 
Instead, we maintain only an array F^^ of size a whose entry [e] keeps the 
number of occurrences of the encoding e in MTF(RBWT(ii;)). The array F^, is 
enough to compute the 0-ordcr entropy of MTF(RBWT(i(;)) in a time (or eventually 
the exact cost of compressing it with Huffman in a log a time) . 

We are left with showing how to correctly keep updated -F^, after a Remove() 
or an Append(t(;). In the following we will concentrate only on Append(?i;) since 
Remove() is symmetrical. The idea underlying the implementation of Append(w), 
where w = S[l, r], is to conceptually insert the symbols of the next page S'[r-|- 1] 
into RBWT(w) one at time from left to right. Since the relative order among 
the symbols of RBWT(i(;) is preserved in BWT(r), it is more convenient to work 
with active symbols of BWT(T) by resorting to a data structure, whose details 
are given later, which is able to efficiently answer the following two queries with 
parameters c, / and ft., where c G S, / = [a, b] is a range of positions in T and 
ft is a position in BWT(r): 

• Prevc(/, ft): locate the last active[a,6] occurrence in BWT(T)[0, ft— 1] of 
symbol c; 

• Nextc(/, ft): locate the first active[a^;,] occurrence in BWT(T)[ft + l,n] of 
symbol c. 

This data structure is built over the whole text T and requires 0(|T|) space. 

Let c be the symbol of 5* [r^ -1-1] we have to conceptually insert in RBWT(r[a, b]). 
We can compute the position (say, ft) of this symbol in BWT(T) by resorting to 
the inverse suffix array of T . Once we know position ft, we have to determine 
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what changes m MTF(RBWT(w)) the insertion of c has produced and update 
accordingly. It is not hard to convince ourselves that the insertion of symbol c 
changes no more than a encodings in MTF(RBWT(w)). In fact, only the first active 
occurrence of each symbol in E after position h may change its MTF encoding. 
More precisely, let hp and /i„ be respectively the last active occurrence of c 
before h and the first active occurrence of c after h in BWT(i(;), then the first 
active occurrence of a symbol after h changes its MTF encoding if and only if it 
occurs active both in B\]'l{w)[hp, h] and in BWT(z«)[/i, /i„]. Otherwise, the new 
occurrence of c has no effect on its MTF encoding. Notice that hp and hn can be 
computed via proper queries Prevc and Nextc. In order to correctly update F^j, 
we need to recover for each of the above symbols their old and new encodings. 
The first step consists of finding the last active occurrence before h of each 
symbols in S using Prev queries. Once we have these positions, we can recover 
the status of the MTF list, denoted A, before encoding c at position h. This is 
simply obtained by sorting the symbols ordered by decreasing position. In the 
second step, for each distinct symbol that occurs active in B\iT{w)[hp, h], we find 
its first active occurrence in BWT(z«)[/i, /i„]. Knowing A and these occurrences 
sorted by increasing position, we can simulate the MTF algorithm to find the old 
and new encodings of each of those symbols. 

This provides an algorithm to perform Append(z«) by making 0{a) queries of 
types Prev and Next for each symbol of the page to append in w. To complete 
the proof of the time bounds in Theorem [S] we have to show how to support 
queries of type Prev and Next in 0(log n) time and 0{n) space. This is achieved 
by a straightforward reduction to a classic geometric range-searching problem. 
Given a set of points P — {{xi,yi), (x2, 2/2), • ■ • , {xp, yp)} from the set [n] x [n] 
(notice that n can be larger than p), such that no pair of points shares the same 
X- or y-coordinate, there exists a data structure [20] requiring 0{p) space and 
supporting the following two queries in O(logp) time: 

• rEingeinax([/, r], /i): return among the points of P contained in [l,r] x 
[—00, h] the one with maximum y- value 

• rajigemin([/, r], /i): return among the points of P contained in [l,r] x 
[h, +00] the one with minimum y-value 

Initially we compute isax and sax in O(nlogcr) time then, for each symbol 
c £ E, we define Pc as the set of points {{i,isaT[i + 1])| T[i] = c} and build 
the above geometric range-searching structure on Pc- It is easy to see that 
PreVc(/, h) can be computed in 0{\ogn) time by calling rangemax(J, isaxlh+l]) 
on the set Pc, and the same holds for Nextc by using rangemin instead of 
rangemax, this completes the reduction and the proof of the theorem. 
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